library(tidyverse)
data <- read.csv("data.csv")
This dataset comes from the UC Machine Learning Repository Mice Protein Expression. It consists of expression levels of 77 proteins modifications that produced signals in the nuclear fraction of the cortex. The dataset includes measurements from 38 control mice and 34 trisomic mice (Down syndrome), as discussed in the study by @doi:10.1371/journal.pone.0129126.
data
Mouse ID: Unique identifier for each mouse, 15 measurements were taken for each protein per mouse.
Expression Levels: The dataset contains values for the expression levels of 77 proteins.
Genotype: The genotype of each mouse, either control (c) or trisomy (t).
Treatment Type: The treatment administered to the mouse, either memantine (m) or saline (s).
Behavior: Describes the behavioral context of each mouse, either context-shock (CS) or shock-context (SC).
The mice are classified into eight groups based on features such as genotype, behavior, and treatment.
blue_palette <- colorRampPalette(c("darkblue", "lightblue"))(8)
ggplot(data, aes(class)) +
geom_bar(aes(fill = class), alpha = 0.8) +
scale_fill_manual(values = blue_palette) +
labs(title = "Number of Classes", y = "Count", x="Class")
c-CS-s: Control mice, stimulated to learn, injected with saline (9 mice)
c-CS-m: Control mice, stimulated to learn, injected with memantine (10 mice)
c-SC-s: Control mice, not stimulated to learn, injected with saline (9 mice)
c-SC-m: Control mice, not stimulated to learn, injected with memantine (10 mice)
t-CS-s: Trisomic mice, stimulated to learn, injected with saline (7 mice)
t-CS-m: Trisomic mice, stimulated to learn, injected with memantine (9 mice)
t-SC-s: Trisomic mice, not stimulated to learn, injected with saline (9 mice)
t-SC-m: Trisomic mice, not stimulated to learn, injected with memantine (9 mice)
The genotypes can either be control or trisomic.
ggplot(data, aes(x = Genotype, fill = class)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = blue_palette) +
labs(title = "Classes for feature: Genotype", x = "Class", y = "Frequency") +
theme_minimal() +
theme(legend.position = c(1.01, 0.8), legend.justification = c(0, 0.5))
In terms of behavior, some mice have been stimulated to learn (context-shock), while others have not (shock-context).
ggplot(data, aes(x = Behavior, fill = class)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = blue_palette) +
labs(title = "Classes for feature: Behavior", x = "Class", y = "Frequency") +
theme_minimal() +
theme(legend.position = c(1.01, 0.8), legend.justification = c(0, 0.5))
In order to assess the effect of the drug memantine on the ability to learn in trisomic mice, some mice were injected with the drug, and others were injected with saline as control.
ggplot(data, aes(x = Treatment, fill = class)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = blue_palette) +
labs(title = "Classes for feature: Treatment", x = "Class", y = "Frequency") +
theme_minimal() +
theme(legend.position = c(1.01, 0.8), legend.justification = c(0, 0.5))
The dataset contains some missing values. In this case, we will handle the missing values by replacing them with the mean of the corresponding protein group.
missing_values <- data %>%
setNames(gsub("_N", "", names(.))) %>%
summarise(across(everything(), ~ sum(is.na(.)), .names = "{.col}")) %>%
pivot_longer(everything(), names_to = "key", values_to = "missing") %>%
mutate(
total = nrow(data),
pct = (missing / total) * 100,
isna = missing > 0
) %>%
filter(isna) %>%
arrange(desc(pct))
percentage.plot <- missing_values %>%
ggplot(aes(x = reorder(key, pct), y = pct)) +
geom_bar(stat = 'identity', fill = 'blue', alpha = 1) +
scale_x_discrete(limits = missing_values$key) +
coord_flip() +
labs(title = "Missing Values In the Dataset", x = 'Protein Name', y = "Percentage of Missing Values") +
theme(plot.title = element_text(hjust = 0.5)) +
guides(fill = "none")
percentage.plot
data <- data %>%
group_by(class) %>%
mutate(across(everything(), ~ replace(., is.na(.), mean(., na.rm = TRUE)))) %>%
as.data.frame()
names(data) <- gsub("_N", "", names(data))
proteins <- names(data[2:78])
classes <- as.vector(unique(as.character(data$class)))
This histogram displays the distribution of protein expression levels across all samples. It shows the frequency of expression values within specified intervals (bins), allowing for an understanding of the overall data distribution. The shape of the histogram reveals patterns such as skewness, central tendency, and the spread of expression levels, providing insights into how proteins are expressed in the dataset. This visualization helps identify any potential trends, outliers, or clusters in the expression levels.
for(i in 1:length(proteins)){
plot <- ggplot(data, aes(eval(parse(text = proteins[i])))) +
geom_histogram(aes(fill=..count..), color = "black", alpha = 0.5) +
scale_fill_gradient("Count", low="blue", high="black") +
labs(title = proteins[i],
x = "Protein Expression Level",
y = "Frequency") +
theme_light()
cat("#### ", proteins[i], "\n")
print(plot)
cat('\n\n')
}
This boxplot illustrates the distribution of protein expression levels across different classes. Each box represents the range of expression values for a specific protein, showing the median, interquartile range (IQR), and potential outliers. The plot allows for easy comparison of protein expression patterns between classes, highlighting variations and trends in the data. By visualizing the spread and central tendency of expression levels, this boxplot helps identify proteins with significantly different expression across the classes.
for(i in 1:length(proteins)){
plot <- ggplot(data, aes(x = class, y = eval(parse(text = proteins[i])))) +
geom_boxplot(aes(fill = class), alpha = 0.8) +
stat_summary(fun.y=mean, colour="black", geom="point", size=1.5) +
labs(y = proteins[i]) +
scale_fill_manual(values = blue_palette)
cat("#### ", proteins[i], "\n")
print(plot)
cat('\n\n')
}
This PCA plot visualizes the distribution of samples from eight distinct classes based on the first two principal components. Each point represents a sample, and the colors indicate the class to which the sample belongs. The plot provides insights into how the samples are grouped and their variation along the principal components, which can be useful for identifying patterns or clustering trends within the data.
library(ggfortify)
library(plotly)
df <- data[2:78]
pca_res <- prcomp(df, scale. = TRUE, rank. = 2)
pca_plot <- autoplot(pca_res, data = data, colour = 'class', frame = TRUE, frame.type = 'norm')
interactive_pca <- ggplotly(pca_plot, tooltip = "text")
interactive_pca <- interactive_pca %>%
layout(
title = "PCA Analysis of Protein Expression by Class",
hovermode = "closest",
margin = list(
l = 80,
r = 80,
b = 80,
t = 80
))
interactive_pca
This analysis involves examining differences in expression levels across genotypes using ANOVA, followed by Tukey’s Honest Significant Difference (HSD) test for pairwise comparisons. Due to file size limit, this part is now moved to separate file.
Based on the result the decision tree has a really bad fit and low accuracy. Due to the file size limit this code and results is now moved to separate python file on git.
library(caret)
library(rpart)
library(randomForest)
data$class <- as.factor(data$class)
X <- data[ , -which(names(data) %in% c("class"))]
y <- data$class
m <- ncol(data)
dt_model <- train(class ~ ., data = na.omit(data[ , -((m-3):(m-1))]), method = 'rpart',
trControl = trainControl(method = "cv", number = 10))
It appears that tuning mtry to 48 provides the most stable and accurate model performance in this grid search. Due to the file size limit this code and results is now moved to separate python file on git.
The comparison of control mice that successfully learned (c-CS-s and c-CS-m) to those that did not undergo training (c-SC-s and c-SC-m) identified several proteins significantly associated with learning. A total of 31 proteins showed significant differences between the groups of mice that underwent successful learning versus those that did not. Proteins such as BRAF, ERK, and pERK, components of the MAPK signaling pathway, were identified. This pathway is well-established in its role in synaptic plasticity and learning. Immediate early gene proteins like EGR1 and BDNF were found to be elevated in the successful learning groups.
When comparing successful learning with saline treatment (c-CS-s) to the non-stimulated control mice (c-SC-s), there were 24 proteins that were significantly different, including DYRK1A, SOD1, and ITSN1, all of which were previously implicated in learning and memory functions. But in Ts65Dn mice, learning failure (c-CS-s) was associated with a distinct protein signature, with proteins like APP and DYRK1A showing upregulated expression, which are known to be linked to cognitive deficits in Down syndrome.
When comparing t-CS-s (failed learning) vs. t-SC-s (non-learning), there were 10 proteins with significant expression differences, including DYRK1A, ITSN1, pERK, BRAF, and SOD1. Some of these proteins were also altered in normal learning of control mice, indicating that even in failed learning, trisomic mice exhibited some normal protein responses.
When comparing t-CS-m (rescued learning) vs. t-SC-m (non-learning with memantine), 36 proteins were significantly different, with 17 showing similar changes to those in normal control mice, suggesting that memantine promotes more normal responses but not entirely. A similar comparison of t-CS-m vs. t-SC-s (non-learning with saline)identified 20 proteins, further showing how memantine rescued learning by altering protein profiles.
Memantine alone (t-SC-m vs. t-SC-s) influenced 12 proteins, such as pNR2A, pPKCAB, and ARC, some of which were also present in control mice. The comparison of rescued (t-CS-m) vs. failed learning (t-CS-s) showed only 9 significant proteins, indicating that while memantine could rescue some aspects of learning, it didn’t fully restore normal protein expression levels.